Loan Data from Prosper by Yusuf Britton

This is a data of over 113,000 borrowers that inquired for loans with Prosper. This will look into some variables that may affect borrowers’ APR or Prosper grade. We will use 12 of the 81 variables in this analysis

Univariate Plots Section

## [1] 113937     13
## 'data.frame':    113937 obs. of  13 variables:
##  $ Term                 : Ord.factor w/ 3 levels "12"<"36"<"60": 2 2 2 2 2 3 2 2 2 2 ...
##  $ LoanStatus           : Factor w/ 12 levels "Cancelled","Chargedoff",..: 3 4 3 4 4 4 4 4 4 4 ...
##  $ BorrowerAPR          : num  0.165 0.12 0.283 0.125 0.246 ...
##  $ ProsperRating..Alpha.: Ord.factor w/ 8 levels ""<"HR"<" E"<"D"<..: 1 7 1 7 4 6 NA 5 8 8 ...
##  $ BorrowerState        : Factor w/ 52 levels "","AK","AL","AR",..: 7 7 12 12 25 34 18 6 16 16 ...
##  $ IsBorrowerHomeowner  : logi  TRUE FALSE FALSE TRUE TRUE TRUE ...
##  $ CreditScoreRangeLower: int  640 680 480 800 680 740 680 700 820 820 ...
##  $ BankcardUtilization  : num  0 0.21 NA 0.04 0.81 0.39 0.72 0.13 0.11 0.11 ...
##  $ DebtToIncomeRatio    : num  0.17 0.18 0.06 0.15 0.26 0.36 0.27 0.24 0.25 0.25 ...
##  $ StatedMonthlyIncome  : num  3083 6125 2083 2875 9583 ...
##  $ LoanOriginalAmount   : int  9425 10000 3001 10000 15000 15000 3000 10000 10000 10000 ...
##  $ MonthlyLoanPayment   : num  330 319 123 321 564 ...
##  $ AnnualIncome         : num  37000 73500 25000 34500 115000 ...
##  Term                       LoanStatus     BorrowerAPR     
##  12: 1614   Current              :56576   Min.   :0.00653  
##  36:87778   Completed            :38074   1st Qu.:0.15629  
##  60:24545   Chargedoff           :11992   Median :0.20976  
##             Defaulted            : 5018   Mean   :0.21883  
##             Past Due (1-15 days) :  806   3rd Qu.:0.28381  
##             Past Due (31-60 days):  363   Max.   :0.51229  
##             (Other)              : 1108   NA's   :25       
##  ProsperRating..Alpha. BorrowerState   IsBorrowerHomeowner
##         :29084         CA     :14717   Mode :logical      
##  C      :18345         TX     : 6842   FALSE:56459        
##  B      :15581         NY     : 6729   TRUE :57478        
##  A      :14551         FL     : 6720                      
##  D      :14274         IL     : 5921                      
##  (Other):12307                : 5515                      
##  NA's   : 9795         (Other):67493                      
##  CreditScoreRangeLower BankcardUtilization DebtToIncomeRatio
##  Min.   :  0.0         Min.   :0.000       Min.   : 0.000   
##  1st Qu.:660.0         1st Qu.:0.310       1st Qu.: 0.140   
##  Median :680.0         Median :0.600       Median : 0.220   
##  Mean   :685.6         Mean   :0.561       Mean   : 0.276   
##  3rd Qu.:720.0         3rd Qu.:0.840       3rd Qu.: 0.320   
##  Max.   :880.0         Max.   :5.950       Max.   :10.010   
##  NA's   :591           NA's   :7604        NA's   :8554     
##  StatedMonthlyIncome LoanOriginalAmount MonthlyLoanPayment
##  Min.   :      0     Min.   : 1000      Min.   :   0.0    
##  1st Qu.:   3200     1st Qu.: 4000      1st Qu.: 131.6    
##  Median :   4667     Median : 6500      Median : 217.7    
##  Mean   :   5608     Mean   : 8337      Mean   : 272.5    
##  3rd Qu.:   6825     3rd Qu.:12000      3rd Qu.: 371.6    
##  Max.   :1750003     Max.   :35000      Max.   :2251.5    
##                                                           
##   AnnualIncome     
##  Min.   :       0  
##  1st Qu.:   38404  
##  Median :   56000  
##  Mean   :   67296  
##  3rd Qu.:   81900  
##  Max.   :21000035  
## 

From looking at the statisical data, we have 113,937 observations over 12 variables.

One thing to note is that DebtToIncomeRatio (1 should be highest), StatedMonthlyIncome, and MonthlyLoanPayment have outliers.

Terms

From looking at the terms, it seems that 36 month is the most common terms for loans

LoanStatus

Most of the loans are still current or completed. There are a some that are charged off or defaulted, while a small amount are pasted due or in final payment

BorrowerAPR

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
## 0.00653 0.15629 0.20976 0.21883 0.28381 0.51229      25

It seems that there is a spike APR around 36%. This will be one of the main variables to test what affect this variable. Average APR is about 21%

ProsperRating (Alpha).

##          HR     E     D     C     B     A    AA  NA's 
## 29084  6935     0 14274 18345 15581 14551  5372  9795

There are 29,084 out of the 113,937 that are blank. subsetting dataframe to only exclude blanks.From looking at the data, it seems that AA is the highest and HR is the lowest.

Also, most borrowers have a C rating

BorrowerState

There are 5,515 out of the 113,937 that are blank. subsetting dataframe to only exclude blanks.Borrowers are mostly from California. Florida, Illinois, New York, and Texas all follow, being a close second. It may be from higher populations from bigger states or more populated cities.

IsBorrowerHomeowner

##    Mode   FALSE    TRUE 
## logical   56459   57478

There are 57,478 home owners and 56,459 that don’t own a home.

CreditScoreRangeLower

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##     0.0   660.0   680.0   685.6   720.0   880.0     591
## 
##     0   360   420   440   460   480   500   520   540   560   580   600 
##   133     1     5    36   141   346   554  1593  1474  1357  1125  3602 
##   620   640   660   680   700   720   740   760   780   800   820   840 
##  4172 12199 16366 16492 15471 12923  9267  6606  4624  2644  1409   567 
##   860   880 
##   212    27

For the lower credit score, it seems most borrowers have around 660-700

BankcardUtilization

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   0.310   0.600   0.561   0.840   5.950    7604

Per the variable definitions, Bank ultilaztion is a percentage. anything past 1 is an error.

Visually, more than of borrowers have more than 50% utilization

DebtToIncomeRatio

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   0.140   0.220   0.276   0.320  10.010    8554

Per the variable definitions, DebtToIncomeRatio is a percentage. anything past 1 is an error.

Moat of the borrowers’ debt to income ratio is 25% to 30%

Annual Income

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0    3200    4667    5608    6825 1750003

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##        0    38404    56000    67296    81900 21000035

Used 99% quantile to remove outliers.

Most borrowers monthly income is around $4,500 to $5,000, annual income ranging is around 40,000 to 60,000

LoanOriginalAmount

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1000    4000    6500    8337   12000   35000

Used 99% quantile to remove outliers.

Borrower’s loan usual borrow around 4,000, 10,000, and 15,000. I wonder if the higher amounts are for homeowners?

MonthlyLoanPayment

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0   131.6   217.7   272.5   371.6  2251.5

Used 99% quantile to remove outliers.

Most borrowers are paying around $150.00 in their monthly payments

What is the structure of your dataset?

There are 113,937 diamonds in the dataset with 12 features (Term, Loan Status, Borrower APR, Prosper Rating, Borrower State, Borrower Homeownership, Credit Score, Bank Utilization, Debt to Income Ratio, Stated monthly income, Loan Original Amount, and Monthly Loan payment).

The variables prosper rating is a ordered factor variable with the following level.

(worst) ——> (best) Prosper rating: HR, E, D, C, B, A, AA

Other observations:

  • Average APR is 21%, with spikes around 36%
  • Most borrowers have a C prosper rating
  • Average credit score is about 685
  • Average bank card utilization is 56%
  • Debt to bank ratio is around 27%

What is/are the main feature(s) of interest in your dataset?

The main feature in the data set is the Borrower APR and Prosper rating. I would like to determine what features affects the APR. I feel that the prosper grading, as well as other variables, affect borrow rate of interest.

What other features in the dataset do you think will help support your
I think Prosper rating, credit score, Bank Utilization, debt to bank ratio,

Annual income, and home ownership may have an effect on the Borrower’s APR

Did you create any new variables from existing variables in the dataset?

I created the Annual income variable

Of the features you investigated, were there any unusual distributions?

I noticed the bank utilization and debt to income ratios have max values higher then 1.

From reading the variable dictionary, These values are ratio from 0 to 1.

I subsetted the file to exclude the outliers

Bivariate Plots Section

From looking at the analysis, borrower’s with a higher prosper rating tends to have a lower APR.

From comparing prosper rating by state, I see concentrations in CA, FL, GA, IL, NY, OH, and TX. I’ll look into these variables further

From research, The possible credit score is 300-850. set limits for this range.

Visually, it seems that having a high credit score somewhat affect APR, but there is a huge APR range concentrate between 640 to 725.

When comparing to states, the credits have greater rangings in states with highly populated cities

There are probably other factors affecting this.

visually, it doesn’t seem like bank utilization has a major affect on APR. One thing to note is that regardless of what utilization is, there is a concentration around 36% APR.

From comparing bank utilization to credit score, it seems that borrowers with lower bank utilization tends to have higher credit scores

visually, it doesn’t seem like debt to bank ratio has a major affect on APR. One thing to note is that regardless of what utilization is, there is a concentration around 36% APR

From comparing home ownership to debt to bank, there doesn’t seem to be much of a difference

visually, it doesn’t seem like debt to bank ratio has a major affect on APR. One thing to note is that regardless of what utilization is, there is a concentration around 36% APR

visually, it doesn’t seem like owning a home has a major affect on APR. One thing to note is that regardless if a borrow owns home, there is a concentration around 36% APR

Bivariate Analysis

Talk about some of the relationships you observed in this part of the

APR correalates to prosper rating and credit score, which looks like it’s batching ranging of credit score

Did you observe any interesting relationships between the other features
(not the main feature(s) of interest)?

What I found interesting was the comparison of states to the prosper data. I will look in how CA, FL, GA, IL, NY, OH, and TX compare to one another in the multivariate analysis

What was the strongest relationship you found?

Between APR and Prosper Grade

Multivariate Plots Section

Each state follow the similar breakdowns between APR and prosper rating. CA and TX has higher concentration around have less than 10% APR. In IL, there are some HR rating that were able to get better rates.

From looking at the credit scores, there are some HR ratings that got better rate than other.

Otherwise, the relationship looks like borrowers for better credit scores have better APR and prosper rating.

As bank card utilization goes up, you see less AA rating and more HR ratings.

For the rating, most AA are in the 0 to 40 percent range. HR is almost as high as 100%. All the other ratings are around 60%

Interestly, there are more homeowners that have AA ratings.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the

From comparing variables to Prosper rating and APR, each variable followed a pretty similar relationship from what was explored in the bivariate plots

Were there any interesting or surprising interactions between features?

I found it surprising the homeowners had higher prosper ratings than non-homeowners


Final Plots and Summary

Plot One

## # A tibble: 7 x 3
##   BorrowerState     n   freq
##   <fct>         <int>  <dbl>
## 1 CA            14717 0.294 
## 2 TX             6842 0.136 
## 3 NY             6729 0.134 
## 4 FL             6720 0.134 
## 5 IL             5921 0.118 
## 6 GA             5008 0.0999
## 7 OH             4197 0.0837
## # A tibble: 7 x 3
##   BorrowerState     n   freq
##   <fct>         <int>  <dbl>
## 1 CA            14717 0.129 
## 2 TX             6842 0.0601
## 3 NY             6729 0.0591
## 4 FL             6720 0.0590
## 5 IL             5921 0.0520
## 6 GA             5008 0.0440
## 7 OH             4197 0.0368

Description One

What I found interesting about this is the intensity of ratings in different states. it could be that the states with the higher concentration were because some they contain some of America’s biggest cities.

In this set, CA makes up 29% of the 6 states chosen in the analysis and 13% of the population

Plot Two

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##     0.0   660.0   680.0   685.6   720.0   880.0     591

Description Two

Though some of the borrowers had perfect credit scores, they still had a high APR interest rating.

Most of the borrowers credit scores ranged from 660 to 720.

Plot Three

## # A tibble: 2 x 4
## # Groups:   IsBorrowerHomeowner [2]
##   IsBorrowerHomeowner ProsperRating..Alpha.     n Percentage
##   <lgl>               <ord>                 <int>      <dbl>
## 1 TRUE                AA                     3847     0.0669
## 2 FALSE               AA                     1525     0.0270

Description Three

Home owners generally had more AA ratings. It could be that homeownership plays a big part on getting a better rating.

There are 3,847 homeowners in the AA rating, and 1,525 non-homeowners in this rating


Reflection

From doing the analysis, I was able to find that APR follow pretty closely with the prosper rating.Some of the struggle I had was trying to find the relationship, only to realize that some of the information isn’t complete. For example, I would have loved to look at rating by occupation, but quickly saw that the “Professional” value made up most of the variable. I found that there was alot of missing information.

I was surprised about the impact homeownership have on securing a AA rating and having a decent APR.

I would like to look back at this data set by acquiring a more complete data set that would show a better breakdown of occupation to do a further analysis, as well as retrieving city data in states that has the highest borrowers